cxo/node: guard fillHead nil-deref races (closeFiller + handleDelConn)#2423
Merged
0pcom merged 1 commit intoskycoin:developfrom May 4, 2026
Merged
cxo/node: guard fillHead nil-deref races (closeFiller + handleDelConn)#24230pcom merged 1 commit intoskycoin:developfrom
0pcom merged 1 commit intoskycoin:developfrom
Conversation
skycoin#2422 silenced the two dominant TPD panics ('Finc to negative', 'short write: i/o deadline reached'). RestartCount went from ~556/6h to 2/10min. The remaining two panics are nil-derefs in pkg/cxo/node/head.go from torn-down fillHead state: 1. createFiller line 338: cr.c.String() panics when cr.c is nil. handleDelConn (line 517) sets f.p.c = nil when the source connection is removed. f.p.r remains non-zero, so the f.p == (connRoot{}) gate at handleFillingResult does NOT catch the nil-c case, and createFiller(f.p) runs with a nil source. Fix: nil-c short-circuit at the top of createFiller. The fill can't proceed without a source connection — log and return. 2. handleSuccess line 228 (and the parallel sites in handleRequest, handleRequestFailure, handleReceivedRoot): f.fc.PushBack on a nil *list.List. closeFiller (line 384) sets `f.rqo, f.fc, f.rq = nil, nil, nil`, so any in-flight Filler goroutine messages that drain after close land on a nil list. Fix: nil-guard each PushBack/PushFront site (4 sites total). Same recovery as skycoin#2422's panic→drop pattern: the closed filler's work is no longer relevant. These are the actual races; the right "real" fix would be to drain or signal-shutdown the channels before nilling state, but that's a larger structural change. Guards stop the process kill while preserving observability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #2422. After the Finc-clamp + short-write fixes deployed, prod TPD's
RestartCountdropped from ~556/6h to 2/10min. The two remaining panics are nil-derefs inpkg/cxo/node/head.go:1.
createFillerline 338 —cr.c.String()on nil*ConnhandleDelConn(head.go:517) setsf.p.c = nilwhen the source connection drops.f.p.rremains non-zero, so thef.p == (connRoot{})zero-value gate athandleFillingResultdoes not catch the nil-c case, andcreateFiller(f.p)runs with a nil source —cr.c.String()panics on the first line.Fix: nil-c short-circuit at the top of
createFiller. The fill can't proceed without a source connection — log and return.2.
handleSuccessline 228 (and three parallel sites) —PushBackon nil*list.ListcloseFiller(head.go:384) setsf.rqo, f.fc, f.rq = nil, nil, nil. Any in-flight Filler-goroutine messages still draining throughf.successq/f.requestq/f.failureqafter the close land on a nil list and panic.Fix: nil-guard each
PushBack/PushFrontsite (4 sites:handleRequest,handleSuccess,handleRequestFailure,handleReceivedRoot's same-Seq branch). Same recovery as #2422's panic→drop pattern: the closed filler's work is no longer relevant.What this isn't
The "real" fix is to drain or signal-shutdown the channels before nilling state — that's a larger structural change to the fillHead's close protocol. These guards stop the process kill while preserving observability so a follow-up PR can address the underlying race without prod restarting every few minutes.
Test plan
go build ./...clean.go vet ./pkg/cxo/...clean.go test ./pkg/cxo/...(includingpkg/cxo/node's own 8s suite) all pass.gofmtclean.docker logs transport-discovery --since 10m | grep -c panic:reaches 0 andRestartCountstops climbing.Sequencing
Sits cleanly on top of #2422 (already merged into develop at
80fef7159).